Hardware friendly fixed-point approximations of video quality metrics

ABSTRACT

A scalable hardware accelerator configured to compute video quality metrics is disclosed. In some embodiments, an accelerator for video quality metrics comprises an application-specific integrated circuit that includes an interface configured to receive pixel data of a frame of a video being analyzed for quality metric determination and a kernel configured to compute a video quality metric for the received pixel data using a fixed-point hardware approximation of a floating-point based algorithm associated with the video quality metric.

CROSS REFERENCE TO OTHER APPLICATIONS

This application claims priority to U.S. Provisional Patent ApplicationNo. 63/061,692 entitled HARDWARE ACCELERATION OF VIDEO QUALITY METRICSfiled Aug. 5, 2020 which is incorporated herein by reference for allpurposes.

BACKGROUND OF THE INVENTION

Video transcoding systems rely on video quality metrics for determiningoptimal video resolutions to serve to end user devices. Video qualitymetrics in existing video transcoding systems have mostly beenimplemented in software and have been limited to less computationallycomplex algorithms so that system resources are not overburdened. Thus,there exists a need for techniques to energy efficiently compute complexvideo quality metrics in video transcoding systems that provide bettermeasures of transcoded video quality at low power.

BRIEF DESCRIPTION OF THE DRAWINGS

Various embodiments of the invention are disclosed in the followingdetailed description and the accompanying drawings.

FIG. 1 is a high level block diagram illustrating an embodiment of anaccelerator architecture for accelerating computations of objectivevideo quality metrics.

FIG. 2 is a high level flow chart illustrating an embodiment of aprocess for computing a video quality metric.

DETAILED DESCRIPTION

The invention can be implemented in numerous ways, including as aprocess; an apparatus; a system; a composition of matter; a computerprogram product embodied on a computer readable storage medium; and/or aprocessor, such as a processor configured to execute instructions storedon and/or provided by a memory coupled to the processor. In thisspecification, these implementations, or any other form that theinvention may take, may be referred to as techniques. In general, theorder of the steps of disclosed processes may be altered within thescope of the invention. Unless stated otherwise, a component such as aprocessor or a memory described as being configured to perform a taskmay be implemented as a general component that is temporarily configuredto perform the task at a given time or a specific component that ismanufactured to perform the task. As used herein, the term ‘processor’refers to one or more devices, circuits, and/or processing coresconfigured to process data, such as computer program instructions.

A detailed description of one or more embodiments of the invention isprovided below along with accompanying figures that illustrate theprinciples of the invention. The invention is described in connectionwith such embodiments, but the invention is not limited to anyembodiment. The scope of the invention is limited only by the claims,and the invention encompasses numerous alternatives, modifications, andequivalents. Numerous specific details are set forth in the followingdescription in order to provide a thorough understanding of theinvention. These details are provided for the purpose of example, andthe invention may be practiced according to the claims without some orall of these specific details. For the purpose of clarity, technicalmaterial that is known in the technical fields related to the inventionhas not been described in detail so that the invention is notunnecessarily obscured.

With the advancement of digital media and growing demand for videocontent, video transcoding has become a common operation in datacenters. Generally, video transcoding is the process of generatingmultiple versions of the same video in different resolutions or sizes.More specifically, a video transcoder typically comprises processingsteps including receiving an input video, decoding the input video, andre-encoding the decoded input video into a plurality of qualities orresolutions (e.g., 360p, 480p, 720p, 1080p, 4K, etc.) that are persistedserver-side so that optimal versions of the video may be selected andprovided to different devices based on corresponding viewport sizesand/or available communication bandwidths. Transcoding an input videointo a prescribed resolution may result in some quality loss in theresulting encoded video having the prescribed resolution. Moreover,scaling the encoded video having the prescribed resolution to differentviewport sizes may result in further quality loss.

Quality metrics comprise a manner for measuring or quantifying qualitylosses resulting from transcoding an input video into an encoded videohaving a prescribed resolution and/or from scaling the encoded videohaving the prescribed resolution to a prescribed viewport size. Videotranscoding applications rely on quality metrics to select an optimalversion of a video for an end user device based on current capabilitiesof the device for receiving and displaying the video. Thus, qualitymetrics are determined for each of a plurality of encoded videoresolutions for each of a plurality of viewport resolutions so thatcorresponding quality scores may be employed to select and provide anappropriate version of a video to an end user device.

Quality metrics may generally be divided into two categories—subjectivequality metrics and objective quality metrics. Subjective qualitymetrics are determined via human test subjects, e.g., by asking usersfor their ratings or scores. Objective quality metrics are determinedvia mathematical models that facilitate computation of correspondingquality scores or values. For example, Peak Signal-to-Noise Ratio (PSNR)comprises a simple computation based on summing squared errors that haswidely been used as an objective pixel quality metric. While subjectivequality metrics provide better measures of true perceptual quality,determining such metrics is not scalable or even feasible for mostapplications. As such, several perception-based objective qualitymetrics have been proposed in recent years that have been correlated tohuman perception during testing and have evolved to closely representsubjective video quality. Examples of such perception-based objectivequality metrics include Structural Similarity Index Measure (SSIM),Multi-Scale SSIM (MS-SSIM), Visual Information Fidelity (VIF), VideoMultimethod Assessment Fusion (VMAF), Detail Loss Metric (DLM), etc.

Objective quality metrics are very resource intensive since computationsare performed for each pixel for each frame of a video. Moreover,computational complexity increases with increasing resolutions sincecomputations have to be performed for more pixels. Furthermore, liveapplications require dynamic computations of quality metrics in realtime that need to be performed without introducing significant latency.Objective quality metrics have traditionally been implemented insoftware, and typically only less computationally complex metrics (suchas PSNR and single scale SSIM) have been employed to minimize resourceconsumption in an associated system. However, more computationallycomplex objective quality metrics offer opportunities for better qualitymeasurements that provide more accurate indications of perceived videoquality. A hardware accelerator dedicated to efficiently computingobjective quality metrics is disclosed herein that provides support fornot only simpler objective quality metrics that have traditionally beenimplemented in software but also for more computationally complexemerging objective quality metrics that have been proposed in literatureand limited to use cases that do not have resource and/or timeconstraints but that have yet to receive adoption in video transcodingsystems due to heretofore introducing unacceptable resource overheads.

FIG. 1 is a high level block diagram illustrating an embodiment of anaccelerator 100 for accelerating computations of objective video qualitymetrics. Accelerator 100 comprises an application-specific integratedcircuit for computing one or more video quality metrics and supportssimultaneously computing multiple video quality metrics in parallel.Various components comprising the architecture of accelerator 100 arescalable for parallel processing based on area and power budgetsavailable with respect to an associated system. Accelerator 100 maycomprise a stand-alone component or a component of a system or device.For example, accelerator 100 may comprise an independent component of avideo transcoding system that offloads resource intensive video qualitymetrics computations from a central processing unit of the transcodingsystem.

A simplified block diagram of components comprising an embodiment ofaccelerator 100 is illustrated in FIG. 1 for the purpose of explanation.However, generally, accelerator 100 may comprise any other appropriatecombination and configuration of components to achieve the describedfunctionality. Although many of the examples described herein are withrespect to computing quality metrics for frames of a transcoded video,the disclosed techniques may be employed to compute quality metrics forany type of image data comprising pixels.

Video quality measurements may be categorized into full referencemetrics, partial reference metrics, and no reference metrics. For a fullreference metric, a complete reference image is available to computedistorted image quality. For a partial reference metric, partialinformation of a reference image such as a set of associated parametersis available to compute distorted image quality. For a no referencemetric, no reference image is available, and the metric is used toestablish source or upload quality. An accelerator for computing videoquality metrics may generally be configured to support any combinationof one or more full, partial, and/or no reference metrics. In theexample of FIG. 1, accelerator 100 is specifically configured to supporta plurality of full reference metrics as well as a no reference metric.A full reference metric comprises comparing a reference frame and adistorted frame on a per pixel or pixel block basis to predictperceptual quality of the distorted frame with respect to the referenceframe. A reference frame comprises an original source frame prior totranscoding while a distorted frame comprises an encoded version of thereference frame after transcoding. A quality score is computed per pixelfor a frame. Quality scores of pixels comprising a frame are accumulatedor combined to generate a frame level score and/or a block level scorefor a portion of the frame. Block level scores may be useful inidentifying regions of a frame that have higher impact on quality.Computed frame level and/or block level scores may be written to memoryso that they are available to other components of an associatedtranscoding system. In some cases, frame level scores of framescomprising a prescribed video are later combined in an appropriatemanner to generate a quality score for the video.

In FIG. 1, controller 102 of accelerator 100 facilitates obtaining inputvideo frame data 101. More specifically, controller 102 of accelerator100 facilitates obtaining frame data 101 from memory or from one or moreintermediary components thereof. For example, in one embodiment,controller 102 communicates, e.g., via a double data rate (DDR) channel,with a direct memory access (DMA) interface that interfaces withphysical memory. Controller 102 facilitates reading both reference framedata and distorted frame data from memory. Controller 102 furthermorecoordinates or synchronizes reads of reference and distorted frame pairsto ensure that read portions of both frames are spatially aligned wheninput into accelerator 100 so that reference and distorted frame pairscan later be processed on a pixel by pixel basis when computing videoquality metrics. Generally, optimally reading data from memory isdesirable since memory bandwidth in an associated system is both alimited and expensive resource in terms of power. By supportingcomputations of multiple metrics simultaneously, accelerator 100 avoidsthe need to read the same frame multiple times from memory for differentmetrics, thus more optimally utilizing both bandwidth and power in anassociated system.

Read input frame data 101 is loaded into one or more local input buffers104. In some embodiments, input buffer 104 is configured in a ping pongbuffer configuration in which one buffer partition is populated withdata read from memory while data comprising another buffer partition isread for processing so that memory read latency can be hidden. Framedata 101 is read from memory by controller 102 and written into inputbuffer 104 in units of a prescribed input block size. The block size maybe based on the size of input buffer 104 and/or a bandwidth supported bya corresponding on-chip network. In some embodiments, pixel blockscomprising a frame are read from memory in a raster scan order, i.e.,from left to right and from top to bottom of the frame. Moreover, pixeldata comprising a frame may furthermore be decoupled into luminance(luma) and interleaved chrominance (chroma) components. Accelerator 100may generally be configured to operate on either or both the luminanceand chrominance planes, which may be segregated and processed byaccelerator 100 in a prescribed order and/or which may be processed inmultiple passes by accelerator 100 with each plane read and processedseparately.

Reference frame data 106 and distorted frame data 108 stored in buffermemory 104 are read by and input into processing unit 110. That is, aportion of reference frame 106 and a corresponding portion of distortedframe 108 that each comprise a prescribed processing block size areinput into processing unit 110 for processing. Processing unit 110comprises the core processing kernel of accelerator 100. Processing unit110 is configured to compute a plurality of video quality metrics orscores based on input frame pixel data. More specifically, processingunit 110 is configured to compute a plurality of differentperception-based video quality metrics for distorted frame 108 withrespect to reference frame 106. Furthermore, processing unit 110 may beconfigured to compute one or more other types of video quality metricssuch as a PSNR metric for distorted frame 108 with respect to referenceframe 106 and/or a no reference metric for reference frame 106 thatindicates source or upload quality prior to transcoding. In someembodiments, processing unit 110 is configured to simultaneously computea plurality of video quality metrics in parallel. For example, in oneembodiment, processing unit 110 is configured to simultaneously computeup to three video quality metrics including a no reference qualitymetric, a PSNR metric, and one of a plurality of supportedperception-based video quality metrics. In such cases, a selected one ofa plurality of supported perception-based video quality metrics thatprocessing unit 110 is currently configured to compute may be specifiedvia a programming interface associated with accelerator 100. Generally,accelerator 100 may be dynamically programmed to compute any one or moresupported video quality metrics and may be programmed differently fordifferent input frames.

Video quality metrics are typically determined for a plurality ofdifferent viewport resolutions for each encoded resolution. Thus, inmany cases, frame data is first scaled to a desired viewport resolution,and then video quality metrics are computed on the scaled output.Processing unit 110 comprises a plurality of programmable inline scalingunits for scaling reference and distorted frame data to desiredresolutions prior to computing one or more video quality metrics. Morespecifically, processing unit 110 comprise scaling unit 112 for scalingreference frame data 106 and scaling unit 114 for scaling distortedframe data 108. Each scaling unit may be dynamically programmed to aprescribed scaling mode (e.g., upscale, downscale, bypass) and scalingratio or factor via an associated programming interface. Scaled outputsare not stored in memory but rather directly input into one or moreprocessing kernels for on the fly inline computations of correspondingvideo quality metrics. By providing inline scaling, the architecture ofaccelerator 100 facilitates more efficient memory bandwidth usage in anassociated system by eliminating the need to write and read scaledoutputs to and from memory. Scaling units 112 and 114 may comprise anyappropriate programmable scaler configurations that, for example, do notintroduce any further or at least any significant quality loss duringthe scaling process.

Scaled frame data is processed by one or more compute kernels that areeach configured to compute one or more video quality metrics. In theembodiment of FIG. 1, processing unit 110 comprises three separatehardware partitions, each with one or more kernels. The various kernelscomprising processing unit 110 implement fixed point versions of thequality metrics algorithms they are configured to implement. That is,floating point operations specified with respect to the originalalgorithms are appropriately modified to fixed point equivalents forefficient hardware implementation. The various components comprisingprocessing unit 110 are scalable. That is, processing unit may beextended to include any number of threads of scaling unit pairs andcompute kernels so that a plurality of viewport resolutions may becomputed in parallel, which effectively facilitates reducing the numberof passes needed to compute scores for all viewport resolutions and, inturn, the number of times input frames need to be read from memory.Various details of the specific embodiment of processing unit 110illustrated in FIG. 1 are next described to provide an example of amanner in which processing unit 110 may be configured. However,generally, processing unit 110 may comprise any other appropriatecombination and configuration of components to achieve the describedfunctionalities.

In the embodiment of FIG. 1, partition 116 of processing unit 110 isconfigured to compute a no reference metric that is used to establishsource quality of an input reference frame. In the given example, kernel118 comprising partition 116 is configured to compute a blurrinessmetric. Kernel 118 may comprise a plurality of stages. For example, inone embodiment, input reference frame data 106 is smoothened using aGaussian blur filter in a first stage, the smoothened output from thefirst stage is input into a Sobel filter to compute pixel gradients in asecond stage, and the output of the second stage is input into a thirdstage that determines edge width values associated with the spread ofthe edge of each pixel, which are then used to compute final blurscores.

Partition 120 of processing unit 110 comprises kernel 122 and kernel124. Kernel 122 is configured to compute a PSNR (sum of squared errors)metric with respect to input reference and distorted frame data. Kernel124 is configured to compute SSIM, for example, using an FFMPEG basedalgorithm, which comprises an overlapped 8×8 approximation algorithm. InSSIM, three components—luminance (L), contrast (C), and structure(S)—based on local means, standard deviations, and cross-covariance ofreference and distorted frame data are computed and combined to obtainan overall similarity measure, i.e., SSIM index.

Partition 126 of processing unit 110 comprises kernel 128. Kernel 128comprises a unified kernel configured to compute single-scale (LIBVMAF)SSIM, multi-scale SSIM, as well as VIF and may be programmed to computeany one of the aforementioned metrics for a given input frame. Inpartition 126, distorted and reference frame data is first filtered viafilter 130 and filter 132, respectively, which in some cases comprisesmoothening Gaussian blur filters. The smoothened frame data output byfilters 130 and 132 is then input into kernel 128 which is configured tocompute LCS values of SSIM. For single scale SSIM, e.g., that iscomputed using an LIBVMAF based algorithm, input pixels are sent oncethrough kernel 128. For MS-SSIM, the smoothened outputs of both framesare sent through corresponding dyadic down samplers 134 and 136 andlooped back to kernel 128 to process higher scales. This process may beiterated up to a prescribed number of times corresponding to a maximumnumber of scales or levels supported. The feedback paths of partition126 facilitate reuse of the same hardware to compute all scales orlevels. Kernel 128 is furthermore configured to compute VIF and supportslogarithmic operations needed to compute VIF scores.

Software algorithms for computing video quality metrics typicallycomprise floating-point operations. However, such algorithms need to beconverted to fixed-point representations when implemented in hardware,such as in the described accelerator architecture. Various challengesexist in converting a floating point representation to a fixed pointrepresentation since at least some error is introduced due to thereduced precision of the fixed-point representation. Moreover, manyvideo quality metrics comprise complex mathematical operations that needto be simplified for viable hardware implementation.

FIG. 2 is a high level flow chart illustrating an embodiment of aprocess 200 for computing a video quality metric. For example, process200 of FIG. 2 may be employed by accelerator 100 of FIG. 1 and/or one ormore components thereof. At step 202, pixel data of a frame of a videobeing analyzed for quality metric determination is received, e.g., at aninterface of a quality metrics application-specific integrated circuitor component thereof. At step 204, a video quality metric is computedfor the received pixel data using a fixed-point hardware approximationof a floating-point based algorithm associated with the video qualitymetric, e.g., by a compute kernel of the application-specific integratedcircuit.

Fixed-point approximations may give rise to numerical instabilities,such as dividing by zero or values that should be positive becomingnegative due to limited precision. In some embodiments, one or moreprogrammable numerical stability constants are introduced into afixed-point approximation of a video quality metric algorithm to preventnumerical instability. In some embodiments, thresholds are employed forvalues of one or more parameters of a video quality metric algorithm topreserve numerical stability. As one example, a parameter of a videoquality metric algorithm may be identified to have the mathematicalproperty of always being positive or non-negative, and a negative valueof the parameter may be clipped to zero or replaced by a (small)positive value greater than zero.

Many video quality metrics comprise complex mathematical operations suchas non-linear functions. To reduce complexity, an approximation of anon-linear function may be employed to realize a fixed-point hardwareimplementation. As an example, consider a logarithmic function, which isemployed in the computation of VIF. For a more hardware friendlyimplementation, a non-linear function, such as a logarithmic function,may be approximated, for example, using a k-th order polynomial or apiecewise linear function comprising a plurality of segments. In apiecewise linear function, slope and intercept values of segmentscomprising the piecewise linear function may be determined, for example,using a least squares method or using a Taylor expansion around a centerof each segment. Segments comprising a piecewise linear function may beuniformly sampled or may be sampled in a non-uniform manner. In anon-uniform sampling of a logarithmic function, for example, moresegments may be used in a region closer to an input argument having avalue of one than a region closer to an input argument having a value oftwo. Hardware implementation may furthermore be simplified byconstraining (e.g., via scaling) an input argument of a logarithmicfunction approximation to a value comprising a prescribed range (e.g.,greater than or equal to one but less than two). In such cases, thelogarithm of any input argument value may be represented by a scaledlogarithm of a value comprising the constrained input argument range.Furthermore, in some embodiments, a sum of lifting values (i.e., scalingvalues to maintain intermediate precision) of the input argument,coefficients, and output is bounded by a prescribed value comprising asystemic constraint on a maximum number of bits supported in acorresponding system.

Although the foregoing embodiments have been described in some detailfor purposes of clarity of understanding, the invention is not limitedto the details provided. There are many alternative ways of implementingthe invention. The disclosed embodiments are illustrative and notrestrictive.

What is claimed is:
 1. A system, comprising: an interface of anapplication-specific integrated circuit configured to receive pixel dataof a frame of a video being analyzed for quality metric determination;and a kernel of the application-specific integrated circuit configuredto compute a video quality metric for the received pixel data using afixed-point hardware approximation of a floating-point based algorithmassociated with the video quality metric.
 2. The system of claim 1,wherein the fixed-point hardware approximation comprises anapproximation of a non-linear function.
 3. The system of claim 2,wherein the approximation of the non-linear function comprises apiecewise linear function.
 4. The system of claim 3, wherein thepiecewise linear function is uniformly or non-uniformly sampled.
 5. Thesystem of claim 2, wherein the approximation of the non-linear functioncomprises a k-th order polynomial.
 6. The system of claim 1, wherein thefixed-point hardware approximation comprises an approximation of alogarithmic function.
 7. The system of claim 6, wherein an inputargument of the approximation of the logarithmic function is scaled to avalue comprising a prescribed range.
 8. The system of claim 6, whereinthe approximation of the logarithmic function comprises a piecewiselinear function with more segments in a region closer to an inputargument having a value of one than a region closer to a value of two.9. The system of claim 1, wherein the fixed-point hardware approximationpreserves numerical stability of the algorithm by using correspondingthresholds for one or more associated parameters of the algorithm. 10.The system of claim 1, wherein the fixed-point hardware approximationreplaces a negative value of a parameter of the algorithm with anon-negative value, wherein the parameter has a mathematical property ofbeing non-negative.
 11. The system of claim 10, wherein the non-negativevalue is zero.
 12. The system of claim 1, wherein the video qualitymetric comprises a no reference metric or a full reference metric. 13.The system of claim 1, wherein the video quality metric comprises aperception-based video quality metric.
 14. The system of claim 13,wherein the perception-based video quality metric comprises one or moreof: Structural Similarity Index Measure (SSIM), Multi-Scale SSIM(MS-SSIM), Visual Information Fidelity (VIF), Video MultimethodAssessment Fusion (VMAF), and Detail Loss Metric (DLM).
 15. A method,comprising: configuring an interface of an application-specificintegrated circuit to receive pixel data of a frame of a video beinganalyzed for quality metric determination; and configuring a kernel ofthe application-specific integrated circuit to compute a video qualitymetric for the received pixel data using a fixed-point hardwareapproximation of a floating-point based algorithm associated with thevideo quality metric.
 16. The method of claim 15, wherein thefixed-point hardware approximation comprises an approximation of anon-linear function.
 17. The method of claim 15, wherein the fixed-pointhardware approximation preserves numerical stability of the algorithm byusing corresponding thresholds for one or more associated parameters ofthe algorithm.
 18. A computer program product embodied in anon-transitory computer readable medium and comprising computerinstructions for: configuring an interface of an application-specificintegrated circuit to receive pixel data of a frame of a video beinganalyzed for quality metric determination; and configuring a kernel ofthe application-specific integrated circuit to compute a video qualitymetric for the received pixel data using a fixed-point hardwareapproximation of a floating-point based algorithm associated with thevideo quality metric.
 19. The computer program product of claim 18,wherein the fixed-point hardware approximation comprises anapproximation of a non-linear function.
 20. The computer program productof claim 18, wherein the fixed-point hardware approximation preservesnumerical stability of the algorithm by using corresponding thresholdsfor one or more associated parameters of the algorithm.